{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we will see how to perform topic extraction using MiniSom. The goal is to extract the main topics (represented as a set of words) that occur in a collection of documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "from minisom import MiniSom\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The colloction of documents that we will work with is the famous `20newsgroups` dataset. It contains more than 10000 newsgroups posts. We will download the dataset using sklearn and will transform the textual documents into a matrix `D` where each row represents a post using <a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer\">TF-IDF representation</a>:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = fetch_20newsgroups(shuffle=True, random_state=1,\n",
    "                             remove=('headers', 'footers', 'quotes'))\n",
    "documents = dataset.data\n",
    "\n",
    "no_features = 1000\n",
    "\n",
    "tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,\n",
    "                                   max_features=no_features,\n",
    "                                   stop_words='english')\n",
    "tfidf = tfidf_vectorizer.fit_transform(documents)\n",
    "tfidf_feature_names = tfidf_vectorizer.get_feature_names()\n",
    "D = tfidf.todense().tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have to train a SOM that clusters the documents, the total number of neurons in the SOM will be also the number of topics to extract:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "n_neurons = 2\n",
    "m_neurons = 4\n",
    "som = MiniSom(n_neurons, m_neurons, no_features)\n",
    "som.pca_weights_init(D)\n",
    "som.train(D, 40000, random_order=False, verbose=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will consider as topic the list of first `top_keywords` associated with the biggest weights of each neuron. With the following for loop we will inspect all the weights and recover the words associated with the weights using the feature names saved by the TfidfVectorizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topic 1 : steve low reported truth want knowledge shall right people don\n",
      "Topic 2 : use don just armenians turkey people like os turkish armenia\n",
      "Topic 3 : used help new buy x11 mail info thanks appreciated advance\n",
      "Topic 4 : read study event ideas writing religious learn ed religion alt\n",
      "Topic 5 : learn files point includes board email sound games home pc\n",
      "Topic 6 : light matter expect final deleted sure administration clinton stuff like\n",
      "Topic 7 : words generally clinton lots dod machines encryption like money jesus\n",
      "Topic 8 : report use cards clipper cable air 17 space 19 mode\n"
     ]
    }
   ],
   "source": [
    "top_keywords = 10\n",
    "\n",
    "weights = som.get_weights()\n",
    "cnt = 1\n",
    "for i in range(n_neurons):\n",
    "    for j in range(m_neurons):\n",
    "        keywords_idx = np.argsort(weights[i,j,:])[-top_keywords:]\n",
    "        keywords = ' '.join([tfidf_feature_names[k] for k in keywords_idx])\n",
    "        print('Topic', cnt, ':', keywords)\n",
    "        cnt += 1"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
